Quiz on GFS

Test your understanding of concepts related to the design of Google File System via a quiz.

Question 5

How can a client find the right chunk in the presence of padding?

Hide Answer

Because of padding and record duplicates, the application-level file size would be less or equal to the bytes the system has occupied to store that file. To read a particular data byte, the clients have to find the chunk number that contains that data byte. Since there are paddings and record duplicates, we may not land on the right chunk just by dividing the data byte offset with chunk size.

what image shows
If a client is reading a file sequentially from start to end, then it has to iterate over all the chunks; there is no need to find a specific chunk. The clients identify the padding and record duplicates while reading all the chunks, discard the padding and record duplicates and return the actual data bytes to the end user.

The padding and record duplicates can be identified using checksums and special markings that are stored with the data.

If a client starts sequential reading from a random offset, or if it’s a small random read, then the client needs to know the right chunk/s containing the requested data byte. The client doesn’t know how much padding or recorded duplicates are present in the file chunks; it can’t just find the right chunk in the first place, nor can it start its search for the right chunk starting from a random estimated chunk. The client has to iterate over all the chunks from the start until it finds the chunk with the requested data byte, which is very costly if we do it on each read. It is on the applications how they tackle these problems on their side.

An application might put hints while writing the records that help later readers approximate the application-level byte index.

We encourage you to brainstorm other ways to find out the application-level byte index efficiently when the underlying file might be mutating concurrently.

5 of 5

Evaluation of GFS

Introduction to Colossus